Introduction

Are you a fan of soccer? Have you ever heard of the century competition between Leo Messi and Cristiano Rolnaldo? Are you interested in numerically finding out which player is better than the others? This is the target of my project, which is predicting player ratings from their match performances.

Captain America

Since this is a supervised learning project, people might wonder, do you have a “test answer” for this kind of question. The answer is YES!! FIFA(Fédération Internationale De Football Association, aka the international soccer association, collaborated with EA gaming and produced a monumental game called FIFA. Each year, these two partners will come up with a new version of FIFA with better motion, texture, lighting, shadow detail and so on, to provide the gamers all around the world with a brand new experience in soccer. Starting from 1993 to now, FIFA and EA have developed from their first generation of FIFA International Soccer, which released in 1993, to FIFA 23, which was released by the end of 2022.

(On the left is the FIFA International Soccer, and on the right is FIFA23)

FIFA International Soccer FIFA 23

From this game, each player are rewarded with a overall rating which is calculated and watched by professional analyst from FIFA by using some mistry algorithms. Some people might doubt the credibility of this stats, because they are mainly used in the gaming. However, I see no concern of it, because the way they collected their data. Sometimes, the analysts will come to the matches and look at the match stats for every player, and also there is time when analysts will come to the training base for most of the clubs and provide some task for the player to do so that they will have a good and accurate estimate of the overall rating. This is why players who might be sitting on bench or injured for the most of the season can also get a reasonable rating after all.

Below are the examples of some top players rating and other minor attributes like shooting, passing, defending index for gamers to know better which area on the field is this particular player best of.

FIFA23 cards

Aren’t you excited about this? Let’s get started!

Loading Packages & Datasets

Fifa21 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_22/players_22.csv")
Fifa20 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_21/players_21.csv")
Fifa19 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_20/players_20.csv")
Fifa18 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_19/players_19.csv")
Fifa17 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_18/players_18.csv")
Fifa16 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_17/players_17.csv")
Fifa21 <- Fifa21 %>% select(c("long_name","short_name","overall"))
Fifa20 <- Fifa20 %>% select(c("long_name","short_name","overall"))
Fifa19 <- Fifa19 %>% select(c("long_name","short_name","overall"))
Fifa18 <- Fifa18 %>% select(c("long_name","short_name","overall"))
Fifa17 <- Fifa17 %>% select(c("long_name","short_name","overall"))
Fifa16 <- Fifa16 %>% select(c("long_name","short_name","overall"))


PlayerStats21 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats21.csv")
PlayerStats20 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats20.csv")
PlayerStats19 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats19.csv")
PlayerStats18 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats18.csv")
PlayerStats17 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats17.csv")
PlayerStats16 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats16.csv")

head(Fifa19) %>%
  arrange(desc(overall))
##                             long_name        short_name overall
## 1      Lionel Andrés Messi Cuccittini          L. Messi      94
## 2 Cristiano Ronaldo dos Santos Aveiro Cristiano Ronaldo      93
## 3       Neymar da Silva Santos Júnior         Neymar Jr      92
## 4                         Eden Hazard         E. Hazard      91
## 5                     Kevin De Bruyne      K. De Bruyne      91
## 6                           Jan Oblak          J. Oblak      91

Here are some top rated players in the FIFA21.

Tidying Data Sets

Before we even start building our models, there is a extremely serious problem of our data sets. Since I found my response variable and predictors from two different websites, when I was trying to combine or left join them into one common data set, the NAME column which I’m using in my left join will not work well because these two websites sometimes call the same player with different name.

For example, for the same player Ederson, who is a Brazilian Goalkeeper and plays for Brazil National Team and Manchester City, are called by his full name – Ederson Santana de Moraes – in one website and just Ederson in another. As you might see, the other Website are not just playing players by their First nor their Last name, but by their common or most familiar name for the fans.

This problem really gives me a headache, I had to figure out a way that could be used for most of our observations.

Data_Combine <- function(Fifa,PlayerStats){
  nwords <- function(string, pseudo=F){
    ifelse( pseudo, 
            pattern <- "\\S+", 
            pattern <- "[[:alpha:]]+" 
          )
    str_count(string, pattern)
  }

  split_names_PlayerStats <- function(df) {
    df$first_name <- NA
    df$last_name <- word(df$Player.Name, 1)
    for (i in 1:nrow(df)) {
      if (nwords(df$Player.Name[i]) == 2) {
        df$first_name[i] <- word(df$Player.Name[i], 1)
        df$last_name[i] <- word(df$Player.Name[i], 2)
      } else if (nwords(df$Player.Name[i]) == 3) {
        df$first_name[i] <- word(df$Player.Name[i], 1)
        df$last_name[i] <- paste(word(df$Player.Name[i], start = 2), ... = word(df$Player.Name[i], 3), sep = " ")
      } else if (nwords(df$Player.Name[i]) == 1) {
        df$last_name[i] <- word(df$Player.Name[i], 1)
      }
    }
    return(df)
  }

  split_names_Fifa <- function(df) {
    df$first_name <- NA
    df$last_name <- word(df$short_name, 1)
    for (i in 1:nrow(df)) {
      if (nwords(df$short_name[i]) == 2) {
        df$first_name[i] <- word(df$short_name[i], 1)
        df$last_name[i] <- word(df$short_name[i], 2)
      } else if (nwords(df$short_name[i]) == 3) {
        df$first_name[i] <- word(df$short_name[i], 1)
        df$last_name[i] <- paste(word(df$short_name[i], start = 2), word(df$short_name[i], 3), sep = " ")
      } else if (nwords(df$short_name[i]) == 1) {
        df$last_name[i] <- word(df$short_name[i], 1)
      }
    }
   return(df)
  }

  get_initials <- function(strings_vector) {
    initials <- character(length(strings_vector))
    for (i in 1:length(strings_vector)) {
      if (is.na(strings_vector[i])) {
        initials[i] <- NA
      } else if (nchar(strings_vector[i]) > 1) {
        initials[i] <- substr(strings_vector[i], 1, 1)
      } else {
        initials[i] <- strings_vector[i]
      }
    }
   return(initials)
  }



  PlayerStats <- split_names_PlayerStats(PlayerStats)
  PlayerStats$first_name <- get_initials(PlayerStats$first_name)
  Fifa <- split_names_Fifa(Fifa)
  Fifa$first_name <- get_initials(Fifa$first_name)
  

  remove_duplicates_by_name <- function(df, first_name_col = "first_name", last_name_col = "last_name") {
    unique_df <- df[!duplicated(df[c(first_name_col, last_name_col)]), ]
    return(unique_df)
  }


  PlayerStats <- remove_duplicates_by_name(PlayerStats)
  Fifa <- remove_duplicates_by_name(Fifa)



  join_data_frames <- function(df1, df2, first_name_col = "first_name", last_name_col = "last_name") {
    merged_df <- merge(df1, df2, by.x = c(first_name_col, last_name_col), by.y = c(first_name_col, last_name_col), all.x = TRUE)
    cleaned_df <- na.omit(merged_df)
    return(cleaned_df)
  }

  Data<-join_data_frames(Fifa,PlayerStats)
  
  return(Data)
}

Data21 <- Data_Combine(Fifa21,PlayerStats21)
Data20 <- Data_Combine(Fifa20,PlayerStats20)
Data19 <- Data_Combine(Fifa19,PlayerStats19)
Data18 <- Data_Combine(Fifa18,PlayerStats18)
Data17 <- Data_Combine(Fifa17,PlayerStats17)
Data16 <- Data_Combine(Fifa16,PlayerStats16)

Soccer <- rbind(Data16,Data17,Data18,Data19,Data20,Data21) %>% 
  select(-c(first_name,last_name,short_name, long_name, Player.Name,Team,GP), #
         -c(Y,YR,R), #Yellow and Red cards does not really matter in ratings
         -c(SOG,ATTDR,BCM,BCS,TOFF),
         -c(SA,AOP,ACR,BCC,ASP,CCSP,CCOP,ACRO,POSL),
         -c(TKLW,LMT,LPOPP,PENT,CLROL,EG,ES,OWN))%>%
  filter(.,POS != "G")

Soccer %>%
  head()
##    overall League POS  MIN G A  S INT CR CC BLK TKL   P AW BR DR  DW TOUCH TBOX
## 3       80   LIGA   D  878 0 0  2  32  1  2   5  18 284 17 42  2  41   497    5
## 5       72   BUND   M 1321 1 0  9  34  1  7   7  42 489 32 61  3 100   859    9
## 10      74   SERI   M 1071 2 0 16  28 10  7   1  18 483 16 99 17  68   779   14
## 12      73   SERI   D  867 0 0  8  15 54  9   3  18 226 14 48 10  52   600    7
## 13      75    EPL   M  121 0 0  4   1  9  1   0   5  24  0  6  2   9    69    7
## 15      69   LIGA   F  230 0 0  1   2  4  5   1  11  68  3 19  2  25   141    4
##    PFT DL AFZP BLKCR PK ALB ATB
## 3   24 30   70     3  0  14   0
## 5   97 93  233     4  0  28   0
## 10 127 76  281     1  0  32   1
## 12  61 48  127    10  0  15   0
## 13   6 18   14     0  0   1   0
## 15  25 22   44     2  0   2   0

Captain America Oh my gosh that’s a huge chunk of code right there and finally my job of tidying and combining data is done. I have narrowed down the data files for all players rating from the Top 7 leagues from 2016-2021 to just one, which I have named “Soccer” in our terminal. As a soccer fan for more than 15 years, I have excluded several predictors that seem irrelevant to a player’s overall rating, such as their names, the number of games they have played, the Red/Yellow cards they have received, and own goals. Additionally, I have excluded some predictors that are highly correlated with existing ones. For example, while I included the predictor “SHOOT,” which indicates the number of shots completed, I excluded “SOG,” which indicates shots on goal, since these two predictors are highly related and derived from each other. Moving forward with model building, I have chosen to exclude them from my Soccer file. Furthermore, I have decided to exclude all goalkeepers from our analysis since most of the predictors are basically irrelevant for them.

Visual EDA

dim(Soccer)
## [1] 13863    26
vis_miss(Soccer)

summary(Soccer)
##     overall        League              POS                 MIN      
##  Min.   :47.0   Length:13863       Length:13863       Min.   :   0  
##  1st Qu.:67.0   Class :character   Class :character   1st Qu.: 383  
##  Median :72.0   Mode  :character   Mode  :character   Median :1194  
##  Mean   :71.7                                         Mean   :1276  
##  3rd Qu.:76.0                                         3rd Qu.:2061  
##  Max.   :94.0                                         Max.   :3787  
##        G                A                S               INT       
##  Min.   : 0.000   Min.   : 0.000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:  3.00   1st Qu.:  2.0  
##  Median : 1.000   Median : 0.000   Median : 10.00   Median : 11.0  
##  Mean   : 1.905   Mean   : 1.335   Mean   : 17.74   Mean   : 16.7  
##  3rd Qu.: 2.000   3rd Qu.: 2.000   3rd Qu.: 24.00   3rd Qu.: 26.0  
##  Max.   :41.000   Max.   :21.000   Max.   :208.00   Max.   :112.0  
##        CR               CC              BLK              TKL        
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:  1.00   1st Qu.:  2.00   1st Qu.: 0.000   1st Qu.:  6.00  
##  Median :  8.00   Median :  7.00   Median : 2.000   Median : 18.00  
##  Mean   : 25.64   Mean   : 13.19   Mean   : 4.419   Mean   : 23.55  
##  3rd Qu.: 33.00   3rd Qu.: 19.00   3rd Qu.: 6.000   3rd Qu.: 36.00  
##  Max.   :382.00   Max.   :136.00   Max.   :61.000   Max.   :146.00  
##        P                AW              BR               DR        
##  Min.   :   0.0   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 104.0   1st Qu.:  4.0   1st Qu.: 18.00   1st Qu.:  2.00  
##  Median : 358.0   Median : 13.0   Median : 60.00   Median :  7.00  
##  Mean   : 475.7   Mean   : 22.5   Mean   : 72.35   Mean   : 12.79  
##  3rd Qu.: 724.5   3rd Qu.: 31.0   3rd Qu.:114.00   3rd Qu.: 18.00  
##  Max.   :2903.0   Max.   :322.0   Max.   :393.00   Max.   :185.00  
##        DW             TOUCH             TBOX             PFT        
##  Min.   :  0.00   Min.   :   0.0   Min.   :  0.00   Min.   :   0.0  
##  1st Qu.: 21.00   1st Qu.: 224.0   1st Qu.:  4.00   1st Qu.:  24.0  
##  Median : 66.00   Median : 721.0   Median : 16.00   Median :  80.0  
##  Mean   : 74.88   Mean   : 844.7   Mean   : 29.58   Mean   : 113.8  
##  3rd Qu.:116.00   3rd Qu.:1326.0   3rd Qu.: 39.00   3rd Qu.: 167.0  
##  Max.   :403.00   Max.   :3908.0   Max.   :337.00   Max.   :1069.0  
##        DL              AFZP          BLKCR             PK         
##  Min.   :  0.00   Min.   :   0   Min.   : 0.00   Min.   : 0.0000  
##  1st Qu.: 23.00   1st Qu.:  57   1st Qu.: 0.00   1st Qu.: 0.0000  
##  Median : 65.00   Median : 188   Median : 1.00   Median : 0.0000  
##  Mean   : 74.99   Mean   : 246   Mean   : 2.53   Mean   : 0.2357  
##  3rd Qu.:112.00   3rd Qu.: 366   3rd Qu.: 3.00   3rd Qu.: 0.0000  
##  Max.   :441.00   Max.   :1707   Max.   :42.00   Max.   :15.0000  
##       ALB              ATB         
##  Min.   :  0.00   Min.   : 0.0000  
##  1st Qu.:  4.00   1st Qu.: 0.0000  
##  Median : 18.00   Median : 0.0000  
##  Mean   : 32.05   Mean   : 0.6297  
##  3rd Qu.: 47.00   3rd Qu.: 1.0000  
##  Max.   :301.00   Max.   :31.0000
Soccer$overall <- as.numeric(Soccer$overall)

Soccer$POS <- factor(Soccer$POS)
Soccer$League <- factor(Soccer$League)

Good thing is there is no missing data at all in our data. And This is a fair amount of observations and I’m quite confident about the process given the fact that the more observation, the more accuracy we can obtain.

Correlation Plot

Soccer %>%
  select(where(is.numeric)) %>%
  cor() %>%
  corrplot(type="upper", tl.cex=0.75)

As we can see, there are a lot of positive correlations including some really strong ones. For example, Min played has some serious correlations across all predictors. This makes sense because the more mins any play spends on the field, his amount of touches, passes and even goals will significantly increase.

ggplot(data=Soccer,mapping = aes(x=forcats::fct_infreq(League))) +
geom_bar(fill = "Orange", color = "black") + 
labs(title="Numbers of Players in Each League",x = "League") +
theme(panel.background = element_rect(fill = "#f0f0f0"),
panel.grid.minor = element_blank(), 
axis.title = element_text(size = 12)) 

ggplot(data=Soccer,mapping = aes(x=forcats::fct_infreq(POS))) +
geom_bar(fill = "Dark Green", color = "black") + 
labs(title = "Number of Players in Each Position",x = "Position") +
theme(panel.background = element_rect(fill = "#f0f0f0"),
panel.grid.minor = element_blank(), 
axis.title = element_text(size = 12)) 

ggplot(data=Soccer, aes(x="", y= MIN)) +
  geom_boxplot(fill="orange", color="black") +
  labs(title="Mins played Distribution", y="Minutes Played") +
  theme_minimal()

summary(Soccer$MIN)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     383    1194    1276    2061    3787
Soccer <- Soccer[Soccer$MIN >= 382,]

This box plot tells me a surprising fact that majority of soccer players even they are playing in top leagues, they would spend a significant amount of time sitting on bench. This plot is highly skewed to the right with with 25 percentile of the players play less than 500 mins for the entire 2021 season, which is approximately 5 games given total about 60 games possible for any professional soccer team in a regular season. That’s sad isn’t it :(

Therefore, I have to exclude the lower 25% of the observations by mins played because their match performance are most likely not reflective to their real abilities.

ggplot(data=Soccer, aes(x=TOUCH, y=P, fill=POS, group = POS)) +
  geom_boxplot() +
  labs(title="Relationship Between Touches and Passes by Position", x="Touches", y="Passes") +
  theme_minimal()

ggplot(Soccer, aes(P)) + 
  geom_bar() +
  labs(title="Numbers of Passes Distribution") +
  xlim(0, 2500) +
  ylim(0, 30) +
  scale_fill_manual()
## Warning: Removed 13 rows containing non-finite values (`stat_count()`).

ggplot(Soccer, aes(x=P, y=TOUCH)) +
 geom_jitter(width = 0.5, size = 1) +
  geom_smooth(method = "lm", se =F, col="Green") +
  labs(title = "Touches vs. Passes")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(Soccer, aes(x=MIN, y=TOUCH)) +
 geom_jitter(width = 0.5, size = 1) +
  geom_smooth(method = "lm", se =F, col="Green") +
  labs(title = "Minutes Played vs. Touches")
## `geom_smooth()` using formula = 'y ~ x'

As we can see here, there is a strong positive linear relationship between touches and passes, touches and mins played. These are important notes we need to consider when we set up the recipe, otherwise it will ruin the effectiveness of our models.

ggplot(data = Soccer, aes(x = POS, y = S, fill = TOUCH)) +
  geom_violin(width = 0.7, color = "Darkgreen") +
  stat_summary(fun = median, geom = "point", shape = 21, size = 3, color = "white") +
  scale_y_continuous(limits = c(0, 100)) +
  labs(title = "Distribution of Touches inside the Box vs. Shoot by Position", x = "Position", y = "Number of Shoot") +
  theme_minimal()
## Warning: Removed 125 rows containing non-finite values (`stat_ydensity()`).
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 125 rows containing non-finite values (`stat_summary()`).

There is also a strong correlation between touches inside the box and shoot, and, furthermore, Forward and Forward/Midfield players have a higher number of touches inside the box and shoot comparing with players of other positions.

In conclusion, it is the fact that we have a lot of correlated variables in our predictors and all of the relationships make sense. In our recipe set up, I have to take into account of these correlations in it, otherwise this could seriously affect our results.

Initial Split & Setting up Recipe

set.seed(6656)
Soccer_split <- initial_split(Soccer,strata = overall,prop = 3/4)
Soccer_test <- testing(Soccer_split)
Soccer_train <- training(Soccer_split)



Soccer_recipe <- recipe(overall ~ . , data=Soccer_train)%>%
  step_dummy(all_nominal_predictors()) %>%
  step_center(all_nominal_predictors()) %>%
  step_scale(all_nominal_predictors()) %>%
  step_interact(terms = ~ P:TOUCH) %>%
  step_interact(terms = ~ TBOX:S) %>%
  step_interact(terms = ~ MIN:TOUCH)

After initial spliting, setting recipe is probably one of the most important step of my project overall. Making sure factors,like League and Position, are dummy coded. Capturing any correlations between predictors in our recipe set up can improve the accuracy of our model and avoid collinearity. Based in our correlation plot, we can see there are some variables such as S and TBOX, TOUCH and P. Ignoring such correlation could cause the model to be inaccurate. Lastly, I centered and scaled all the nominal predictors.

K-Fold Cross Validation

Soccer_folds <- vfold_cv(Soccer_train, v = 5, strata = overall)

Using K-fold cross validation is important because the performance of a model on a single dataset can be highly dependent on the particular data points included in that dataset, and may not generalize well to other datasets. By using k-fold cross-validation, we can get a more reliable estimate of a model’s performance, since it is evaluated on multiple partitions of the data, rather than just one.

K-fold cross-validation can also help to identify issues such as overfitting, where the model performs well on the training data but poorly on new, unseen data. By evaluating the model on different folds of the data, we can get a better sense of how well the model is likely to generalize to new data.

Overall, k-fold cross-validation is a useful technique for evaluating and comparing machine learning models, and is often used in research and in practice to help ensure that models are robust and reliable. We saved our folded data into Soccer_folds.

Model Building

Now, after setting up a satisfactory recipe and cross validation, it is time to build our models. In the following steps, I will conduct 8 different models, including Linear Regression, K-nearest Neighbors, Elastic Net, Ridge Regression, Lasso Regression, Boosted Trees, and Random Forest. In addition, Root Mean Square Error – RMSE – will become my metric when it comes to measure the accuracy of my models. In another words, when training our models in Soccer_folds, the one has the lowest RMSE will become our best model. In the end I will preceed with two best models in our final fitting just to see how result could change from using the best or the second best model.

Furthermore, because many of the models need some sort of tuning, meaning that we are testing different levels and range of those hyperparameters and see which one gives us the best results, using autoplot function in R enables us to observe the impact of every tuned parameter on each model’s performance. The autoplots illustrate the model’s performance through its RMSE score, where lower scores indicate better performance.

Let’s see what happens.

Linear Regression

lm_reg <- linear_reg() %>% 
  set_engine("lm")

lm_workflow <- workflow() %>% 
  add_model(lm_reg) %>% 
  add_recipe(Soccer_recipe)

lm_fit <- lm_workflow %>% 
  fit_resamples(resamples = Soccer_folds)

best_lm <- collect_metrics(lm_fit)%>%
  slice(1);best_lm
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard    4.23     5  0.0698 Preprocessor1_Model1

K-Nearest Neighbors & Autoplot

knn_mod <- nearest_neighbor(neighbors = tune()) %>%
  set_mode("regression") %>%
  set_engine("kknn")

knn_workflow <- workflow() %>%
  add_model(knn_mod) %>%
  add_recipe(Soccer_recipe)



neighbors_grid_knn <- grid_regular(neighbors(range = c(1, 10)), levels = 10)

tune_res_knn <- tune_grid(
  object = knn_workflow, 
  resamples = Soccer_folds, 
  grid = neighbors_grid_knn,
  control = control_grid(verbose = TRUE)
)
## i Fold1: preprocessor 1/1
## ✓ Fold1: preprocessor 1/1
## i Fold1: preprocessor 1/1, model 1/1
## ✓ Fold1: preprocessor 1/1, model 1/1
## i Fold1: preprocessor 1/1, model 1/1 (predictions)
## i Fold2: preprocessor 1/1
## ✓ Fold2: preprocessor 1/1
## i Fold2: preprocessor 1/1, model 1/1
## ✓ Fold2: preprocessor 1/1, model 1/1
## i Fold2: preprocessor 1/1, model 1/1 (predictions)
## i Fold3: preprocessor 1/1
## ✓ Fold3: preprocessor 1/1
## i Fold3: preprocessor 1/1, model 1/1
## ✓ Fold3: preprocessor 1/1, model 1/1
## i Fold3: preprocessor 1/1, model 1/1 (predictions)
## i Fold4: preprocessor 1/1
## ✓ Fold4: preprocessor 1/1
## i Fold4: preprocessor 1/1, model 1/1
## ✓ Fold4: preprocessor 1/1, model 1/1
## i Fold4: preprocessor 1/1, model 1/1 (predictions)
## i Fold5: preprocessor 1/1
## ✓ Fold5: preprocessor 1/1
## i Fold5: preprocessor 1/1, model 1/1
## ✓ Fold5: preprocessor 1/1, model 1/1
## i Fold5: preprocessor 1/1, model 1/1 (predictions)
collect_metrics(tune_res_knn)
## # A tibble: 20 × 7
##    neighbors .metric .estimator  mean     n std_err .config              
##        <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1         1 rmse    standard   5.88      5  0.0800 Preprocessor1_Model01
##  2         1 rsq     standard   0.263     5  0.0150 Preprocessor1_Model01
##  3         2 rmse    standard   5.45      5  0.0785 Preprocessor1_Model02
##  4         2 rsq     standard   0.301     5  0.0150 Preprocessor1_Model02
##  5         3 rmse    standard   5.14      5  0.0793 Preprocessor1_Model03
##  6         3 rsq     standard   0.335     5  0.0153 Preprocessor1_Model03
##  7         4 rmse    standard   4.94      5  0.0812 Preprocessor1_Model04
##  8         4 rsq     standard   0.360     5  0.0158 Preprocessor1_Model04
##  9         5 rmse    standard   4.82      5  0.0823 Preprocessor1_Model05
## 10         5 rsq     standard   0.377     5  0.0160 Preprocessor1_Model05
## 11         6 rmse    standard   4.73      5  0.0820 Preprocessor1_Model06
## 12         6 rsq     standard   0.390     5  0.0161 Preprocessor1_Model06
## 13         7 rmse    standard   4.67      5  0.0808 Preprocessor1_Model07
## 14         7 rsq     standard   0.400     5  0.0159 Preprocessor1_Model07
## 15         8 rmse    standard   4.63      5  0.0793 Preprocessor1_Model08
## 16         8 rsq     standard   0.407     5  0.0156 Preprocessor1_Model08
## 17         9 rmse    standard   4.59      5  0.0777 Preprocessor1_Model09
## 18         9 rsq     standard   0.413     5  0.0152 Preprocessor1_Model09
## 19        10 rmse    standard   4.57      5  0.0767 Preprocessor1_Model10
## 20        10 rsq     standard   0.418     5  0.0150 Preprocessor1_Model10
autoplot(tune_res_knn)

best_neighbors_knn <- select_by_one_std_err(tune_res_knn, desc(neighbors), metric = "rmse");best_neighbors_knn
## # A tibble: 1 × 9
##   neighbors .metric .estimator  mean     n std_err .config          .best .bound
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>            <dbl>  <dbl>
## 1        10 rmse    standard    4.57     5  0.0767 Preprocessor1_M…  4.57   4.64

Elastic Net & Autoplot

en_mod <- linear_reg(mixture = tune(), 
                              penalty = tune()) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

en_workflow <- workflow() %>%
  add_recipe(Soccer_recipe) %>%
  add_model(en_mod)

en_grid <- grid_regular(penalty(range = c(0, 1),
                                     trans = identity_trans()),
                        mixture(range = c(0, 1)),
                             levels = 10)

tune_res_en <- tune_grid(
  en_workflow,
  resamples = Soccer_folds, 
  grid = en_grid
)

collect_metrics(tune_res_en)
## # A tibble: 200 × 8
##    penalty mixture .metric .estimator  mean     n std_err .config               
##      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
##  1   0           0 rmse    standard   4.26      5  0.0712 Preprocessor1_Model001
##  2   0           0 rsq     standard   0.488     5  0.0138 Preprocessor1_Model001
##  3   0.111       0 rmse    standard   4.26      5  0.0712 Preprocessor1_Model002
##  4   0.111       0 rsq     standard   0.488     5  0.0138 Preprocessor1_Model002
##  5   0.222       0 rmse    standard   4.26      5  0.0712 Preprocessor1_Model003
##  6   0.222       0 rsq     standard   0.488     5  0.0138 Preprocessor1_Model003
##  7   0.333       0 rmse    standard   4.27      5  0.0710 Preprocessor1_Model004
##  8   0.333       0 rsq     standard   0.486     5  0.0137 Preprocessor1_Model004
##  9   0.444       0 rmse    standard   4.28      5  0.0709 Preprocessor1_Model005
## 10   0.444       0 rsq     standard   0.483     5  0.0137 Preprocessor1_Model005
## # … with 190 more rows
autoplot(tune_res_en)
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis

best_en <- select_by_one_std_err(tune_res_en,
                                metric = "rmse", penalty, mixture);best_en
## # A tibble: 1 × 10
##   penalty mixture .metric .estimator  mean     n std_err .config    .best .bound
##     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>      <dbl>  <dbl>
## 1       0       0 rmse    standard    4.26     5  0.0712 Preproces…  4.23   4.30

Ridge Regression & Autoplot

ridge_reg <- linear_reg(mixture = 0, 
                         penalty = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

ridge_workflow <- workflow() %>% 
  add_recipe(Soccer_recipe) %>% 
  add_model(ridge_reg)

penalty_grid <- grid_regular(penalty(range = c(-5,5)), levels = 20)

tune_res_ridge <- tune_grid(
  ridge_workflow,
  resamples = Soccer_folds,
  grid = penalty_grid
)
## ! Fold1: internal:
##   There were 4 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 17: `penalty = 2636.651`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold2: internal:
##   There were 4 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 17: `penalty = 2636.651`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold3: internal:
##   There were 4 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 17: `penalty = 2636.651`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold4: internal:
##   There were 4 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 17: `penalty = 2636.651`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold5: internal:
##   There were 4 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 17: `penalty = 2636.651`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
collect_metrics(tune_res_ridge)
## # A tibble: 40 × 7
##      penalty .metric .estimator  mean     n std_err .config              
##        <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.00001   rmse    standard   4.26      5  0.0712 Preprocessor1_Model01
##  2 0.00001   rsq     standard   0.488     5  0.0138 Preprocessor1_Model01
##  3 0.0000336 rmse    standard   4.26      5  0.0712 Preprocessor1_Model02
##  4 0.0000336 rsq     standard   0.488     5  0.0138 Preprocessor1_Model02
##  5 0.000113  rmse    standard   4.26      5  0.0712 Preprocessor1_Model03
##  6 0.000113  rsq     standard   0.488     5  0.0138 Preprocessor1_Model03
##  7 0.000379  rmse    standard   4.26      5  0.0712 Preprocessor1_Model04
##  8 0.000379  rsq     standard   0.488     5  0.0138 Preprocessor1_Model04
##  9 0.00127   rmse    standard   4.26      5  0.0712 Preprocessor1_Model05
## 10 0.00127   rsq     standard   0.488     5  0.0138 Preprocessor1_Model05
## # … with 30 more rows
autoplot(tune_res_ridge)

best_ridge <- select_by_one_std_err(tune_res_ridge,
                                metric = "rmse", penalty);best_ridge
## # A tibble: 1 × 9
##   penalty .metric .estimator  mean     n std_err .config            .best .bound
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>              <dbl>  <dbl>
## 1 0.00001 rmse    standard    4.26     5  0.0712 Preprocessor1_Mod…  4.26   4.33

Lasso Regression & Autoplot

lasso_reg <- linear_reg(penalty = tune(), 
                         mixture = 1) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

lasso_workflow <- workflow() %>% 
  add_recipe(Soccer_recipe) %>% 
  add_model(ridge_reg)

penalty_grid <- grid_regular(penalty(range = c(-5,5)), levels = 20)

tune_res_lasso <- tune_grid(
  lasso_workflow,
  resamples = Soccer_folds,
  grid = penalty_grid
)
## ! Fold1: internal:
##   There were 4 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 17: `penalty = 2636.651`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold2: internal:
##   There were 4 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 17: `penalty = 2636.651`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold3: internal:
##   There were 4 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 17: `penalty = 2636.651`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold4: internal:
##   There were 4 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 17: `penalty = 2636.651`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold5: internal:
##   There were 4 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 17: `penalty = 2636.651`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
collect_metrics(tune_res_lasso)
## # A tibble: 40 × 7
##      penalty .metric .estimator  mean     n std_err .config              
##        <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.00001   rmse    standard   4.26      5  0.0712 Preprocessor1_Model01
##  2 0.00001   rsq     standard   0.488     5  0.0138 Preprocessor1_Model01
##  3 0.0000336 rmse    standard   4.26      5  0.0712 Preprocessor1_Model02
##  4 0.0000336 rsq     standard   0.488     5  0.0138 Preprocessor1_Model02
##  5 0.000113  rmse    standard   4.26      5  0.0712 Preprocessor1_Model03
##  6 0.000113  rsq     standard   0.488     5  0.0138 Preprocessor1_Model03
##  7 0.000379  rmse    standard   4.26      5  0.0712 Preprocessor1_Model04
##  8 0.000379  rsq     standard   0.488     5  0.0138 Preprocessor1_Model04
##  9 0.00127   rmse    standard   4.26      5  0.0712 Preprocessor1_Model05
## 10 0.00127   rsq     standard   0.488     5  0.0138 Preprocessor1_Model05
## # … with 30 more rows
autoplot(tune_res_lasso)

best_lasso <- select_by_one_std_err(tune_res_lasso,
                                metric = "rmse", penalty);best_lasso
## # A tibble: 1 × 9
##   penalty .metric .estimator  mean     n std_err .config            .best .bound
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>              <dbl>  <dbl>
## 1 0.00001 rmse    standard    4.26     5  0.0712 Preprocessor1_Mod…  4.26   4.33

Random Forest & Autoplot

rf_spec <- rand_forest(mtry = tune(), 
                       trees = tune(), 
                       min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

rf_workflow <- workflow() %>% 
  add_recipe(Soccer_recipe) %>% 
  add_model(rf_spec)

rf_parameter_grid <- grid_regular(mtry(range = c(4, 8)), trees(range = c(200,800)), min_n(range = c(30,50)), levels = 8)

rf_tune_res <- tune_grid(
  rf_workflow,
  resamples = Soccer_folds,
  grid = rf_parameter_grid
)

collect_metrics(rf_tune_res)
## # A tibble: 640 × 9
##     mtry trees min_n .metric .estimator  mean     n std_err .config             
##    <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
##  1     4   200    30 rmse    standard   4.35      5  0.0690 Preprocessor1_Model…
##  2     4   200    30 rsq     standard   0.477     5  0.0140 Preprocessor1_Model…
##  3     5   200    30 rmse    standard   4.32      5  0.0683 Preprocessor1_Model…
##  4     5   200    30 rsq     standard   0.483     5  0.0135 Preprocessor1_Model…
##  5     6   200    30 rmse    standard   4.30      5  0.0643 Preprocessor1_Model…
##  6     6   200    30 rsq     standard   0.485     5  0.0125 Preprocessor1_Model…
##  7     7   200    30 rmse    standard   4.28      5  0.0673 Preprocessor1_Model…
##  8     7   200    30 rsq     standard   0.487     5  0.0131 Preprocessor1_Model…
##  9     8   200    30 rmse    standard   4.27      5  0.0691 Preprocessor1_Model…
## 10     8   200    30 rsq     standard   0.487     5  0.0133 Preprocessor1_Model…
## # … with 630 more rows
autoplot(rf_tune_res)

best_rf <- select_by_one_std_err(rf_tune_res, metric = "rmse",mtry,trees,min_n)

Boosted Trees & Autoplot

bt_reg <- boost_tree(mtry = tune(), 
                           trees = tune(), 
                           learn_rate = tune()) %>%
  set_engine("xgboost") %>% 
  set_mode("regression")

bt_reg_workflow <- workflow() %>% 
  add_model(bt_reg) %>% 
  add_recipe(Soccer_recipe)

bt_grid <- grid_regular(mtry(range = c(1, 6)), 
                        trees(range = c(200, 600)),
                        learn_rate(range = c(-10, -1)),
                        levels = 5)

bt_tune_res <- tune_grid(
  bt_reg_workflow, 
  resamples = Soccer_folds, 
  grid = bt_grid
)
## ! Fold1: internal:
##   There were 5 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 1`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 2`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 3`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 4`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 6`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold2: internal:
##   There were 5 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 1`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 2`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 3`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 4`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 6`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold3: internal:
##   There were 5 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 1`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 2`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 3`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 4`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 6`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold4: internal:
##   There were 5 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 1`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 2`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 3`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 4`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 6`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold5: internal:
##   There were 5 warnings in `dplyr::summarise()`.
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 1`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 2`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 3`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 4`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
##   The first warning was:
##   ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
##     = na_rm)`.
##   ℹ In group 1: `mtry = 6`, `trees = 200`, `learn_rate = 1e-10`.
##   Caused by warning:
##   ! A correlation computation is required, but `estimate` is constant an...
##   ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
autoplot(bt_tune_res)

show_best(bt_tune_res, n = 1)
## Warning: No value of `metric` was given; metric 'rmse' will be used.
## # A tibble: 1 × 9
##    mtry trees learn_rate .metric .estimator  mean     n std_err .config         
##   <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1     3   200        0.1 rmse    standard    4.25     5  0.0568 Preprocessor1_M…
best_bt <- select_by_one_std_err(bt_tune_res, metric = "rmse",mtry,trees,learn_rate)

Model Results

It is time to check how our models do comparing with each other!

Overall_results <- bind_rows(best_lm, best_neighbors_knn, best_en, best_ridge, best_lasso, best_rf, best_bt) %>% 
  tibble() %>% 
  mutate(model = c("Linear Regression", "K-nearest Neighbors", "Elastic net", "Ridge Regression", "Lasso Regression", "Random Forest", "Boosted Trees")) %>% 
  arrange(mean)%>%
  select(model,mean) %>%
  mutate(mean = round(mean, 5))


Overall_results
## # A tibble: 7 × 2
##   model                mean
##   <chr>               <dbl>
## 1 Linear Regression    4.23
## 2 Elastic net          4.26
## 3 Ridge Regression     4.26
## 4 Lasso Regression     4.26
## 5 Boosted Trees        4.27
## 6 Random Forest        4.32
## 7 K-nearest Neighbors  4.57

Winning It looks like the Linear Regression gives us the best rmse and become our most ideal model to use on the testing data set. This also makes sense because I have captured many potential collinearity in our predictors during the step of recipe set up, as we having a sense that there are a fair amount of correlation between our predictor in our correlation plot, Linear Regression finally give us the best results. Now, we proceed to the final fitting step. Allmost There!

Final Fitting

lm_final_workflow_train <- finalize_workflow(lm_workflow, best_lm)
rf_final_fit_train <- fit(lm_final_workflow_train, data = Soccer_train)

Soccer_tibble <- predict(rf_final_fit_train, new_data = Soccer_test %>% select(-overall))
Soccer_tibble <- bind_cols(Soccer_tibble, Soccer_test %>% select(overall))

rmse(Soccer_tibble, truth = overall, .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.21

Based on our calculated rmse, it is out of my expectation since it has done a even better job compare to the training data. In addition, since the observations are ranged from 0 to 100, the Linear Regression model hasn’t done a bad job.

Soccer_tibble %>% 
  ggplot(aes(x = .pred, y = overall)) +
  geom_point(alpha = 0.4) +
  geom_abline(lty = 2) +
  theme_grey() +
  coord_obs_pred() +
  labs(title = "Predicted Values vs. Actual Values")

Winning Woah! That’s some very expressive result right there. If most of our points are on the diagnal axis on this predicted values vs. actual values graph, that means we have reached a pretty decent precision in predicting the actual values.

Conclusion

In conclusion, My Machine Learning project has been a resounding success in achieving a high level of precision using the Linear Regression model. I successfully designed, developed, and deployed a robust and accurate prediction model that could handle the complexities of our dataset.

One of the key factors in the success was the careful preprocessing of my data sets. I spent a significant amount of time cleaning, formatting and combining the datasets to ensure that it was sufficient and suitable for analysis.

After preprocessing, I experimented with different models and techniques to determine the best approach for my dataset. The Linear Regression model was the clear winner, outperforming other popular models such as random forests and support vector machines. I then fine-tuned the parameters of the model to achieve the lowest possible Root Mean Square Error (RMSE), which is a standard metric used to evaluate the accuracy of regression models.

Looking ahead, I believe that further research and experimentation with different models and techniques can improve the performance of my system and enhance its applicability. In summary, I am incredibly proud and thankful of the work I have accomplished on this project, especially the help from the professor Doctor Coburn. I am excited about the possibilities that Machine Learning can offer in the future, and I’m looking forward to continuing exploring in this exciting field. Captain America